Introduction to program R


data science diagram

Introduction to program R


data science diagram

Introduction to program R


data science diagram

Re-thinking data


How we’ve interacted with data dictates how we structure data mentally.

Screenshot of microsoft excel window

Re-thinking data


data science diagram

Today’s goals


  • Value: Any datum (single unit of data)
  • Object: Container for holding values
  • Indexing: Querying objects by position
  • Logic: Querying objects using logical operators

Values


A value is defined by its:

Symbols representing different data values
  • State
  • Attributes (metadata), e.g.:
    • Name
    • Class
  • Context

Values


Values are distinguished by similarities and differences in their multi-dimensional states, attributes, and contexts.

Symbols representing different data values, with types of values arranged by row

Values


Values are distinguished by similarities and differences in their multi-dimensional states, attributes, and contexts.

Symbols representing different data values, with types of values arranged by row

Values


Values are distinguished by similarities and differences in their multi-dimensional states, attributes, and contexts.

Symbols representing different data values, with types of values arranged by row

Classes of values


In R, the most commonly used types of values are:

  • Number values: numeric and integers
  • Character: string (i.e., symbols)
  • Factor: a smart combination of words/letters and integers
  • Logical: integer values of 0 and 1 assigned to words

Classes of values: Numeric


Number values can be either:

  • Numeric: double precision floating point numbers (on the user end this can be thought of as a decimal number)
  • Integer: whole numbers
# Create a vector of numeric values:

numericV <-
  c(3, 2, 1, 1)

numericV

Classes of values: Numbers


Number values can be either:

  • Numeric: double precision floating point numbers (on the user end this can be thought of as a decimal number)
  • Integer: whole numbers
# What type of object is this?

class(numericV)

str(numericV)

summary(numericV)

Classes of values: Numbers


Number values can be either:

  • Numeric: double precision floating point numbers (on the user end this can be thought of as a decimal number)
  • Integer: whole numbers
# Create a vector of numeric integer values:

numericInteger <- 
  1:5

numericInteger

# What type of object is this?

class(numericInteger)

str(numericInteger)

summary(numericInteger)

Classes of values: Character


A character or “string” value is a symbol or set of symbols from a given library

# Create a vector of character values:

exampleCharacter <- 
  c('three', 'two', 'one', 'one')

exampleCharacter

# What type of object is this?

class(exampleCharacter)

str(exampleCharacter)

summary(exampleCharacter)

Classes of values: Factor


A factor value includes the following information:

  • Integer value: Numeric integer value associated with factor level
  • Levels: Character values associated with integer value
  • Labels: Characters to assign to each factor level
# Create a vector of factor values:

exampleFactor <- 
  factor(
    c('three', 'two', 'one', 'one'))

exampleFactor

# What type of object is this?

class(exampleFactor)

str(exampleFactor)

summary(exampleFactor)

Classes of values: Factor


Barplot with three bars representing factor levels

Classes of values: Factor


A factor value includes the following information:

  • Integer value: Numeric integer value associated with factor level
  • Levels: Character values associated with integer value
  • Labels: Characters to assign to each factor level
# Set factor levels and labels:

factor(
  c('three', 'two', 'one', 'one'))

factor(
  c('three', 'two', 'one', 'one'),
  levels = c('one', 'two', 'three')
  )

Classes of values: Factor


Barplot with three bars representing factor levels

Classes of values: Factor


A factor value includes the following information:

  • Integer value: Numeric integer value associated with factor level
  • Levels: Character values associated with integer value
  • Labels: Characters to assign to each factor level
# Set factor levels and labels:

factor(
  c('three', 'two', 'one', 'one'))

factor(
  c('three', 'two', 'one', 'one'),
  levels = c('one', 'two', 'three'),
  labels = c('One', 'Two', 'Three')
  )

Classes of values: Factor


Barplot with three bars representing factor levels

Classes of values: Logical


R reserves the words TRUE and FALSE as logical constants. These constants are mapped to integer values:

  • FALSE: 0
  • TRUE: 1
# Observe the behavior of logical values:

FALSE

TRUE

as.numeric(FALSE)

as.numeric(TRUE)

FALSE + TRUE

FALSE + TRUE + TRUE

Classes of values: Logical


Logical values can be obtained by evaluating objects with logical operators. For example, the logical operator == tests whether a value is equal to another value.

# The "is equal to" logical operator:

3 == 3

3 == 4

3 == 2 + 1

3 == 3 + 1

(3 == 3) + (3 == 2 + 1)

Objects: Containers for values


In R, containers called objects structure collections of values. Different types of objects store values in different ways:


Object dimensions Homogeneous class Heterogeneous class
1-D Atomic vector List
2-D Matrix Data frame

Objects: Containers for values


For each object type, we’ll address:

  • Structure
  • Indexing
  • Attributes

Vector objects: Structure


An atomic vector is a one-dimensional collection of values. All values must be of the same class.

Each value in a vector has a position, denoted by “[x]”


[1] [2] [3] [4]
1 1 2 3

Vector objects: Structure


An atomic vector is a one-dimensional collection of values. All values must be of the same class.

# A vector of numeric values:

numericVector <- 
  c(1, 1, 2, 3)

numericVector
## [1] 1 1 2 3
summary(numericVector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    1.50    1.75    2.25    3.00

Vector objects: Structure


An atomic vector is a one-dimensional collection of values. All values must be of the same class.

# All values in a vector must be of the same class:

numericVector
## [1] 1 1 2 3
messyVector <- 
  c(1, 'one', 2, 3)

messyVector
## [1] "1"   "one" "2"   "3"

Vector objects: Indexing


Each value in a vector has a position, denoted by “[x]”

[1] [2] [3] [4]
1 1 2 3
# Use indexing to subset a vector:

numericVector

numericVector[3]

numericVector[3:4]

numericVector[c(1,3)]

Vector objects: Attributes


Typical attributes we are interested in of vectors include:

  • Class: What type of values?
  • Length: How many values?
# Attributes of the vector:

class(numericVector)

length(numericVector)

str(numericVector)

Vector objects: Attributes


Attributes can be added to vectors.


# Adding attributes to a vector:

numericVector

names(numericVector)

names(numericVector) <- 
  c('orange', 'pear', 'apple', 'apple')

Vector objects: Attributes


Vectors can be indexed by their names attribute.

[‘orange’] [‘pear’] [‘apple’] [‘apple’]
1 1 2 3
numericVector[2]

numericVector['pear']

numericVector[2] == numericVector['pear']

numericVector[c('orange', 'pear')]

Matrix objects: Structure


A matrix is a two dimensional object – basically a vector that has been split into multiple columns. All values must be of the same class.

Values in a matrix have a row (x) and column (y) position, denoted by “[x, y]”


[ ,1] [ ,2]
[1, ] 1 2
[2, ] 1 3

Matrix objects: Structure


A matrix is a two dimensional object – basically a vector that has been positioned as multiple columns. All values must be of the same class.

# Generate matrix:

m <- matrix(c(1, 1, 2, 3), ncol = 2)

m
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3

Matrix objects: Structure


A vector can be structured horizontally (row-wise) or vertically (column-wise) within a matrix:

# Compare matrices built row-wise and column-wise:

matrix(
  c(1, 1, 2, 3),
  ncol = 2, 
  byrow = TRUE)
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    3
matrix(
  c(1, 1, 2, 3),
  ncol = 2, 
  byrow = FALSE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3

Matrix objects: Structure


Because matrices must be homogeneous, all values are forced to be the same type.

# Matrix built with multiple types:

messyMatrix <- 
  matrix(
    c(1, 'one', 2, 3),
    ncol = 2)

messyMatrix
##      [,1]  [,2]
## [1,] "1"   "2" 
## [2,] "one" "3"

Matrix objects: Indexing


Values in a matrix have a row (x) and column (y) position, denoted by “[x, y]”

[ ,1] [ ,2]
[1, ] 1 2
[2, ] 1 3
# Index by row (x) and column (y) position [x,y]:

m[1,1]

m[2,2]

m[1:2,2]

Matrix objects: Attributes


There are a number of attributes that can be observed for a given matrix:

# View matrix attributes:

class(m)

length(m)

dim(m)

str(m)

summary(m)

Matrix objects: Attributes


You may add a name attribute to rows and columns.

# Naming rows and columns:

colnames(m) <- 
  c('a', 'b')

rownames(m) <- 
  c('c', 'd')

attributes(m)

m

List objects: Structure


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

List position is denoted by [[x]].

[[1]]

[1] [2] [3] [4]
1 1 2 3

[[2]]

[ ,1] [ ,2]
[1, ] 1 2
[2, ] 1 3

[[3]]

[ ,1] [ ,2]
[1, ] “1” “2”
[2, ] “one” “3”

List objects: Structure


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

# List of a numeric vector and matrices:

exampleList <- 
  list(numericVector, m, messyMatrix)

exampleList
## [[1]]
## [1] 1 1 2 3
## 
## [[2]]
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## 
## [[3]]
##      [,1]  [,2]
## [1,] "1"   "2" 
## [2,] "one" "3"

List objects: Indexing


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

List position is denoted by [[x]].

[[1]]

[1] [2] [3] [4]
1 1 2 3

[[2]]

[ ,1] [ ,2]
[1, ] 1 2
[2, ] 1 3

[[3]]

[ ,1] [ ,2]
[1, ] “1” “2”
[2, ] “one” “3”

List objects: Indexing


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

List position is denoted by [[x]].

# List indexing:

exampleList
## [[1]]
## [1] 1 1 2 3
## 
## [[2]]
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## 
## [[3]]
##      [,1]  [,2]
## [1,] "1"   "2" 
## [2,] "one" "3"
exampleList[[2]]
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3

List objects: Indexing


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

List position is denoted by [[x]].

# List indexing:

exampleList[[2]]

exampleList[[2]] == m

m[2,2]

exampleList[[2]][2,2]

List objects: Attributes


Typical attributes we are interested in of lists include:

  • Class: What type of values?
  • Length: How many values?
# Attributes of a list:

class(exampleList)

length(exampleList)

str(exampleList)

List objects: Attributes


Typical attributes we are interested in of lists include:

  • Class: What type of values?
  • Length: How many values?
# Attributes of list items:

class(exampleList[[1]])

length(exampleList[[1]])

List objects: Attributes


Attributes can be added to lists


# Adding attributes to a list:

exampleList

names(exampleList)

names(exampleList) <- 
  c('numericVector', 'm', 'messyMatrix')

attributes(exampleList)

List objects: Attributes

Lists can be indexed by their names attribute using matrix notation or the $ operator.

# Lists can be indexed by name using the notation:

exampleList[[3]]

exampleList[['messyMatrix']]

exampleList$messyMatrix

Data frame objects: Structure


A data frame is a two dimensional object constructed by combining vectors.

Each value in a data frame has a row and column position, denoted by “[x, y]”


[ ,1] [ ,2]
[1, ] 1 1
[2, ] 2 3

Data frame objects: Structure


A data frame is a two dimensional object constructed by combining vectors.

# Generate a data frame:

df <- 
  data.frame(a = c(1, 1), b =  c(2, 3))

df
##   a b
## 1 1 2
## 2 1 3

Data frame objects: Structure


The vectors that are contained in a data frame may be of different classes.

# Generate a data frame of different vector classes:

data.frame(
  a = c('one', 'one'),
  b =  c(2, 3))
##     a b
## 1 one 2
## 2 one 3

Data frame objects: Structure


But vectors are still coerced into the same class!

# Attempt to generate a data frame with heterogeneous vectors:

messyDf <- data.frame(
  a = c(1, 'one'),
  b =  c(2, 3))

messyDf
##     a b
## 1   1 2
## 2 one 3

Data frame objects: Indexing


Values in a data frame have a row (x) and column (y) position, denoted by “[x, y]”


[ ,1] [ ,2]
[1, ] 1 1
[2, ] 2 3
# Index by row (x) and column (y) position [x,y]:

df[1,1]

df[2,2]

df[1:2,2]

Data frame objects: Attributes


There are a number of attributes that can be observed for a given data frame:

# View data frame attributes:

str(df)

class(df)

length(df)

dim(df)

summary(df)

Data frame objects: Attributes


Always check attributes prior to working with data frame!

# View attributes of the messy dataframe:

str(messyDf)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: Factor w/ 2 levels "1","one": 1 2
##  $ b: num  2 3
dfStrings <- data.frame(
  a = c(1, 'one'), 
  b =  c(2, 3),
  stringsAsFactors = FALSE
  )

str(dfStrings)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: chr  "1" "one"
##  $ b: num  2 3

Data frame objects: Attributes


Name attributes are automatically set when a data frame is created. Failing to set this attribute leads to bad names:

# Set and unset names:

data.frame(
  a = c(1, 1), 
  b =  c(2, 3))
##   a b
## 1 1 2
## 2 1 3
data.frame(
  c(1, 1),
  c(2, 3))
##   c.1..1. c.2..3.
## 1       1       2
## 2       1       3

Data frame objects: Attributes


Similar to other objects, the names attribute can also be set manually after an object is created:

# View data frame attributes:

exampleDf <- 
  data.frame(
    c(1, 1),
    c(2, 3))

names(exampleDf) <- 
  c('hello', 'world')

exampleDf
##   hello world
## 1     1     2
## 2     1     3

Data frame objects: Attributes


Data frames can be indexed by their names attribute using matrix notation or the $ operator.

# View data frame attributes:

exampleDf['hello']


exampleDf$hello

Data frame objects: The tibble!


A tibble is a special type of data frame provided by the package tidyverse.

# Read tidyverse package(s):

library(tidyverse)

# Generate a tibble data frame:

tibbleDf <- 
  data_frame(
    a = c(1, 'one'),
    b =  c(2, 3))

tibbleDf
## # A tibble: 2 x 2
##   a         b
##   <chr> <dbl>
## 1 1        2.
## 2 one      3.

Data frame objects: The tibble!


Base R data frames can also be converted to a tibble.

# Convert a data frame to a tbl:

tbl_df(messyDf)
## # A tibble: 2 x 2
##   a         b
##   <fct> <dbl>
## 1 1        2.
## 2 one      3.
tbl_df(
  data.frame(
    a = c(1, 'one'), 
    b =  c(2, 3)))
## # A tibble: 2 x 2
##   a         b
##   <fct> <dbl>
## 1 1        2.
## 2 one      3.

Data frame objects: The tibble!


How do tibbles differ from Base R data frames?

# Compare tibble and base R data frame:

data.frame(
  a = c(1, 'one'),
  b =  c(2, 3))
##     a b
## 1   1 2
## 2 one 3
data_frame(
  a = c(1, 'one'),
  b =  c(2, 3))
## # A tibble: 2 x 2
##   a         b
##   <chr> <dbl>
## 1 1        2.
## 2 one      3.

Data frame objects: The tibble!


How do tibbles differ from Base R data frames?

# Load data from:

data(mtcars)

mtcars

tbl_df(mtcars)

Summary


Symbols representing different data values
Symbols representing different data values
  • Values:
    • Numbers
    • Characters
    • Factors
    • Logical values

  • Objects:
    • Vectors
    • Matrices
    • Lists
    • Data frames

Summary


By next Wednesday, please complete this worksheet.